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ABSTRACT 

Augmented reality (AR) displays become more and more 
popular recently, because of its high intuitiveness for humans 
and high-quality head-mounted display have rapidly devel¬ 
oped. To achieve such displays with augmented informa¬ 
tion, highly accurate image registration or ego-positioning 
are required, but little attention have been paid for out¬ 
door environments. This paper presents a method for ego¬ 
positioning in outdoor environments with low cost monoc¬ 
ular cameras. To reduce the computational and memory 
requirements as well as the communication overheads, we 
formulate the model compression algorithm as a weighted k- 
cover problem for better preserving model structures. Specif¬ 
ically for real-world vision-based positioning applications, 
we consider the issues with large scene change and propose 
a model update algorithm to tackle these problems. A long¬ 
term positioning dataset with more than one month, 106 
sessions, and 14,275 images is constructed. Based on both 
local and up-to-date models constructed in our approach, ex¬ 
tensive experimental results show that high positioning ac¬ 
curacy {mean ~ 30.9cm, stdev. ~ 15.4cm) can be achieved, 
which outperforms existing vision-based algorithms. 


Categories and Subject Descriptors 

1.2.10 [Artificial Intelligent]: Vision and Scene Under¬ 
standing— 3D/stereo scene analysis] 1.4.9 [Image Process¬ 
ing and Computer Vision]: Applications 


General Terms 

Algorithms, Performance, Experimentation 

Keywords 

Outdoor positioning. Model Compression, Model Update, 
Long-Term Positioning Dataset 

1. INTRODUCTION 

Ego-positioning aims at locating an object in a global co¬ 
ordinate system based on the sensors mounted on the ob¬ 
ject. With the growth of mobile or wearable devices, highly 
accurate positioning becomes increasingly important. Eor 
example, Microsoft HoloLens [24] uses AR technologies to 
display information in a natural way and enable users to in¬ 
teract with three-dimensional holograms. Pioneer NavGate 
HUD [29] augments traffic and navigation information on 
a head-up display. Eigure 1 shows another two examples 
of AR applications on mobile devices while highly accurate 
positioning is available. It enables display information on 
the corresponding area of image and helps users to connect 
virtual information with real world. An intuitive and fine¬ 
grained navigation suggestion can be provided instead of 
macro suggestion, such as turn right at the next intersec¬ 
tion, which provided by the traditional navigation system 
but sometimes difficult to be understood when the road is 
complicated. To achieve such AR displays, highly accurate 
image registration or ego-positioning will be necessary. 

Unlike indoor positioning [7, 8, 15, 20, 23], less effort has 
been made on high-accurate ego-positioning for outdoor en¬ 
vironments, and GPS (Global Positioning System) is still 
the most popular positioning technology widely used. How¬ 
ever, the precision of GPS sensors is around 3 to 20 meters 
[6, 26] and existing systems do not perform well in urban 
areas full of high rises. Although several positioning meth¬ 
ods based on expensive sensors such as radar and Velodyne 
3D laser scanners can achieve high accuracy, it is currently 
not practical to deploy them in mass, especially for low-cost 
mobile devices. 



Figure 1: Examples of applications while highly accurate positioning is available: (a) a natural way to provide 
buildings, shopping, or navigation information with AR technologies on mobile phone or head-mounted 
display, and (b) to provide fine-grained navigation, such as changing lane, on head-up display of vehicles. 



Figure 2: Overview of the proposed vision-based 
positioning system. SIFT features from images ac¬ 
quired on vehicles are matched against 3D models 
previously constructed for ego-positioning. In addi¬ 
tion, the newly acquired images are used to update 
3D models. 


In this paper, we introduce a vision-based positioning al¬ 
gorithm based on low-cost monocular cameras within the 
loT (Internet-of-Things) framework. It exploits visual in¬ 
formation from the images captured by mobile devices to 
achieve high positioning accuracy even in crowded traffic 
situations. An overview of the proposed algorithm is shown 
in Figure 2. 

To this end, our approach includes the training and ego¬ 
positioning phases. In the training phase, if a person/vehicle 
passes a specific area (e.g., an intersection) that can be 
roughly positioned by GPS, the captured images will be 
uploaded to a cloud system. After many persons/vehicles 
passing by that area, the cloud system will collect a suf¬ 
ficient number of images to construct the local 3D point 
set model of the area by using a structure-from-motion al¬ 
gorithm [33]. In the ego-positioning phase, the device on 
human/vehicle will start to download all the local 3D scene 
models along the route. When en route, the GPS system in¬ 
forms the person/vehicle its rough position, and the mobile 


device will match its current image with the corresponding 
local 3D model for ego-positioning. Notice that our sys¬ 
tem only needs the device to download the scene models 
in the ego-positioning phase, and upload its own images for 
model construction and update when the wireless bandwidth 
is available. 

Model-based positioning [18, 21, 30, 31] have been studied 
in recent years, but most approaches construct a single city- 
scale or close-world model and focus on fast correspondence 
search. Such a large-scale model makes the correspondence 
matching task difficult, and thus the rate of successfully reg¬ 
istered images is under 70% and high positioning accuracy 
is difficult achieve (positioning error is with median of 1.6m 
and mean of 5.5m in [18], and with median of 1.3m and 
mean of 15m in [30]) Furthermore, model-based positioning 
approaches always suffer from two main problems: how to 
build a large-scale (or world-scale) model and how to make 
the model up-to-date. 

The main contribution of this work are summarized as 
follows. First, we propose a novel algorithm within the loT 
framework. By collecting new observations from previous 
passing person/vehicle, it solves the above-mentioned prob¬ 
lems regarding model construction and makes the model- 
based positioning practical. A local and up-to-date model 
is constructed that facilities sub-meter positioning accuracy. 
Second, a novel model compression algorithm formulated 
as a weighted k-cover problem to preserve model structure 
is developed. Third, a model update method is presented 
and a dataset including more than 14,000 images from 106 
sessions for more than one month (including sunny, cloudy, 
rainy, night scenes) is constructed. To the best of our knowl¬ 
edge, the proposed system is first to consider large outdoor 
scene changes over a long period of time for model-based 
positioning with sub-meter accuracy. 

2. RELATED WORK 

Vision-based positioning aims to match current images 
with stored images or pre-constructed models for relative 
or global pose estimation. Such algorithms can categorized 
































into three types according to the registered data for match¬ 
ing. i.e., consecutive frames, images in a database, and 3D 
models. In addition, we discuss related work on compression 
of 3D models and long-term positioning. 

2.1 Vision-Based Positioning 

2.1.1 Matching with Consecutive Frames 

Positioning methods by matching the current with previ¬ 
ous frames to estimate relative motion are widely used in 
robotics [14, 16, 32], which is also known as visual odome- 
try. These methods combine the odometry sensor readings 
and visual data from a monocular or stereo camera to esti¬ 
mate local motion incrementally. The main drawbacks are 
only relative motion can be estimated and the accumulated 
errors is large (about 1 meter for every 200 meters move¬ 
ment [32]) with the drifting problems. In [2] Brubaker et 
al. incorporate the road maps to alleviate the problem with 
accumulated errors. However, the positioning accuracy is 
low (i.e., localize objects with up to 3 meter accuracy). 

2.1.2 Matching with Database Images 

Methods of this category match images with those in a 
database to determine the current position [4, 37] . Such ap¬ 
proaches are usually used in multimedia applications where 
accuracy is not of prime importance, such as non-GPS-tagged 
photo localization and landmark identification. The posi¬ 
tioning accuracy is usually low (between 10 and hundred of 
meters). 

2.1.3 Matching with 3D Models 

Methods in this category are based on a constructed 3D 
model. Arth et al. [1] propose a method to combine sparse 
3D reconstructions and manually determine visibility infor¬ 
mation to estimate camera poses in an indoor environment. 
Wendel et al. [35] extend this method to a micro aerial ve¬ 
hicle for localization for the outdoors scenarios. Both these 
methods are evaluated in limited spatial domains within a 
short amount of time. 

For large-scale localization, one main challenge is how to 
match features efficiently and effectively. Sattler et al. [30] 
propose a direct 2D-to-3D matching method based on vi¬ 
sual vocabulary quantization and a prioritized correspon¬ 
dence search to speed up the feature matching process. This 
method is extended to both 2D-to-3D and 3D-to-2D search 
for more efficiency [31]. Lim et al. [19] propose to extract 
more efficient descriptors for every 3D point across multi¬ 
ple scales for scale invariance in feature matching. Li et 
al. [18] further deal with a worldwide scale problem includ¬ 
ing hundreds of thousands of images and tens of millions 
of 3D points. As these methods focus on an efficient corre¬ 
spondence search, the sheer large number of 3D point cloud 
model leaded to registration rate of 70% and mean of local¬ 
ization error of 5.5 meter [18]. These methods are thus not 
applicable to AR displays on mobile devices due to position 
accuracy, computational and memory requirements. 

Recently, methods that use a combination of local visual 
odometry and global model-based positioning are proposed 
[25, 34]. These systems estimate relative poses of mobile 
devices and carry out global image-based localizations on 
remote servers to overcome the problem of accumulative er¬ 
rors. 

However, all of these model-based positioning approaches 
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Figure 3: Overview of the proposed vision-based po¬ 
sitioning system. The blue dot lines denote wireless 
communication links. 


do not consider outdoor scene changes, and evaluate the 
images taken in the same session when the model is con¬ 
structed. In this paper, we present a system that deals with 
large scene changes, and update models for long-term posi¬ 
tioning within the loT context. 

2.2 Model Compression 

In addition to positioning, compression methods have been 
developed to reduce the computational and memory require¬ 
ments for large-scale models [3, 10, 17, 28]. Irschara et al. 
[10] and Li et al. [17] use a greedy algorithm to solve a set 
cover problem to minimize the number of 3D points while en¬ 
suring a high probability of successful registration. Park et 
al. [28] formulate the set cover problem as a mixed integer 
quadratic programming problem to obtain an optimal 3D 
point subset. Cao and Suavely [3] further take into account 
both coverage and distinctiveness and propose a probabilis¬ 
tic approach to yield better performance. However, these 
approaches do not consider the structure information and 
suffer from the problem of uneven spatial distribution, which 
leads to worse pose estimation. To overcome this problem, 
a weighted set cover problem is proposed in this paper to 
ensure the reduced 3D points fairly distributed in all planes 
and lines, thereby facilitating accurate positioning. 

2.3 Long-Term Positioning 

Little attention has been paid for long-term positioning. 
Although there are some works [11, 12] in robotics consid¬ 
ering outdoor scene changes for localization and mapping, 
they only match 2D query images with dataset images col¬ 
lected from different period of time. Correct matching is 
considered the distance between both images is less than 10 
or 20 meters by GPS measurement instead of estimating its 
3D absolute positions as what we do in this paper. 

3. SYSTEM OVERVIEW 

As discussed in Section 1, the proposed system consists 
of two phases. Figure 3 shows the system overview. The 
training phase is carried out on local machines or cloud sys¬ 
tems. It includes three main components, and all of them 
are processed off-line with batch processing. First, image- 
based modeling builds a 3D point cloud model from collected 
images automatically. Second, structure preserving model 
compression reduces model size not only for reducing the 















computational and memory requirements [3, 10, 17, 28] but 
also for minimizing communication overheads for loT sys¬ 
tems as these models need to be transmitted from a cloud 
server to mobile devices. In addition, the model is updated 
with newly arrived images. In this paper, a model pool is 
constructed for model update (See Section 4.4). 

The ego-positioning phase is carried out on mobile devices. 
The 2D-to-3D image matching and localization component 
matches images from a camera with the corresponding 3D 
point cloud model of this area (roughly localized by GPS) to 
estimate 3D positions. Notice that there are two communi¬ 
cation links in the proposed system. One is to download 3D 
point cloud models to a mobile device, or preloaded when 
a fixed walking/driving route is given. However, the size of 
each uncompressed 3D point model (ranging from 105.1 MB 
MB to 12.8 MB in our experiments) is still a crucial problem 
(as numerous models are required for real-world long-range 
trips). The other link is to upload the images to a cloud 
server for model update, but not required frequently. 

4. VISION-BASED POSITIONING SYSTEM 

As explained in Section 3, there are four main components 
in our system which are described in the following sections. 

4.1 Image-Based Modeling 

Image-based modeling aims to construct a 3D model from 
a number of input images [5] . One of the well-known image- 
based modeling systems is the Photo Tourism method [33], 
which uses structure from motion to estimate camera poses 
and 3D scene geometry from images. In this section, we 
use their method to build the 3D point cloud model. Af¬ 
ter images from mobile devices in a local area are acquired, 
the SIFT [22] feature descriptors between each pair of im¬ 
ages are then matched using an approximate nearest neigh¬ 
bors kd-tree. A fundamental matrix is then estimated using 
BANS AC to remove outliers, which violate the geometric 
consistence. 

Next, an incremental structure from motion method is 
used to avoid bad local minimal solutions and to reduce 
the computational load. It recovers camera parameters and 
3D locations of feature points by minimizing the sum of 
distances between the projections of 3D feature points and 
their corresponding image features based on the following 
objective function: 

n m 

EE Vijd{Q(cj,Pi) ,pij), (1) 

’ i=l j = l 

where Cj is the camera parameters of image j; m is the 
number of images; Pi is 3D coordinates of feature point i; 
n is the number of feature points; Vij denotes the binary 
variables that equals I if point i is visible in image j and 0 
otherwise; Q{cj^Pi) projects the 3D point i onto the image 
j] pij is the corresponding image feature of i on j; and d{.) is 
the distance function. This objective function can be solved 
by using bundle adjustment, and a 3D point cloud model 
P is built. It includes the positions of 3D feature points 
and the corresponding SIFT feature descriptor list for each 
point. An example of image-based modeling is shown in 
Figure 4. 

4.2 Structure Preserving Model Compression 



Figure 4: An example of image-based modeling from 
400 images. 


How to condense the model so that it retains the ego¬ 
positioning capability of the full model is a central issue to 
reduce the computation, memory, and communication over¬ 
heads. A feasible way would be sorting the 3D points by 
their visibility and keep the most observed points. How¬ 
ever, it might cause non-uniform spatial distribution of the 
points because the points visible to many views could be dis¬ 
tributed in a small area in an image. Positioning with these 
points thus leads to an inaccurate estimation of the camera 
pose. Following this idea, previous works [10, 17, 28] formu¬ 
lated the point-selection problem as a set k-cover problem 
to hnd a minimum set of 3D points that ensures at least 
k points visible in each camera view. However, they suffer 
easily from the problem of non-uniform spatial distribution 
with each view as discussed above. The problem becomes 
particularly serious with the local area models constructed 
in our system. To overcome this problem, we propose an 
approach that solves a weighted set k-eover problem^ where 
the weights are given to ensure the selected points to be 
fairly distributed on all planes and lines in the area. The 
condensed model follows better the spatial structure of the 
scene. It is crucial for our system to achieve sub-meter ac¬ 
curacy on session data. Details of our algorithm are given 
in the following. 

Let the 3D point cloud reconstructed be P = {Pi,...,Piv}, 
Pi G and N is the total number of points. First, we detect 
planes and lines from the 3D point cloud by using RANSAC 
algorithm [36]. This procedure select three random points 
to generate a plane model and evaluate it by counting the 
number of points consistent to that model {i.e. with the 
point-to-plane distance smaller than a threshold). Repeat 
the random-selection procedures; then the plane having the 
most counts is selected as the detected plane. We sim¬ 
ply follow the one-at-of-time principle for detecting multiple 
planes. After one plane is detected via RANSAC, the 3D 
points belonging to this plane are omitted, and then next 
plane will be detected from the remaining points. This is 
repeated until that there are no planes with sufficient con¬ 
sistent counts. Line detection is performed after plane de¬ 
tection in a similar process, unless two points are selected to 
generate a line-model hypothesis in each iteration. An ex¬ 
ample of plane and line detections of the model in Figure 4 
is shown in Figure 5(a). 

Note that, by using RANSAC and the one-at-a-time strat¬ 
egy, the planes and lines detected are usually distributed 
over the scene because the main structure (in terms of planes 
or lines) will be found at hrst. This is helpful in selecting the 
non-local common-view points for condensing the 3D point 
model. The planes and lines detected in the original point 



Figure 5: (a) An example of plane and line detec¬ 
tion. (b) An example of point cloud model after 
model compression. 


cloud model are denoted as (/ = 1 • • • L), where L is the 
total number of planes and lines. The rest points not be¬ 
longing to any plane or line is grouped as a single category 
L + 1. Given a 3D point Pi in the 3D model reconstructed, 
we initially set its weight Wi as 


yj. — } ^ 

1(tl+i, otherwise, 

(2) 

cri = \{Pi\Pien,yi}\/N, 

(3) 

aL+i = \{Pi\Pi^riyi,yi}\/N. 

(4) 


That is, the plane or line having a larger portion of points 
is given with a higher weight for point selection. 

Based on the weights given in (2), a weighted set k-cover 
problem is formulated. However, the minimum set k-cover 
problem is a NP-hard. Therefore, we use a greedy algorithm 
to solve it efficiently. Let {ci,...,Cm} be the m camera views 
taken for the 3D model reconstruction in Section 4.1. We 
hope to select a subset of points from P such that there are 
at least k points viewable for each camera, and the sum of 
weights of the selected points is maximized. 

Following the greedy principle for solving a set covering 
problem, our approach iteratively selects the most visible 
point until at least k points are selected for every camera. 
In the beginning, we find one point from P at first. This 
point (denoted as Ps) satisfies the following criterion: 


with 


m 

Ps = argmax W Wiv{P,Cj), 


v{P,Cj) 


fl, if P€Cj, 

1 0, otherwise , 


(5) 

(6) 


being the visibility of the point P from the j-th camera. 
Hence, Pg found by (5) is the most commonly visible point 
in terms of the respective weight. We then remove Pg from 
P with P ^ P\Ps in our greedy approach. Then, our ap¬ 
proach keeps finding the most visible point (Pg) based on the 
new point set P, and the procedure is iterated accordingly. 
Adaptive weight set k-cover: In the above we assume 
that the weights {wi} are fixed. However, although the 
planes and lines found by our approach are usually dis¬ 
tributed over the scene, the points selected are still possibly 
centralized in a local region inside a single plane or line. 


Algorithm 1 Structure preserving model compression 
Input: 3D point cloud model P = {Pi,...,Pw}, cameras 
{ci,...,Cm}, integer k 

I: Initialize the compressed model M ^ 0, number of cov¬ 
ered points C[j] = 0, for all camera cj 
2: Detect planes and lines ri,...,rL in P by RANSAC algo¬ 
rithm 

3: Assign weight Wi to each point Pi by (2) 

4: while C[j] < k, for all camera Cj do 
5: Select Pg by (5) 

6: M ^ P, 

7: P ^ P\Pg 

8: for all cameras Cj do 

9: if Pg G Cj then 

10: C[j] = C[j] + 1 

II: end if; 

12: end for; 

13: for all 3D points Pi do 

14: if Pg G ri and Pi G ri then 

15: Wi = Wi/2 

16: end if; 

17: if Pg ^ Cj for all j s.t. C[j] < k then 

18: Wi = 0 

19: end if; 

20: end for; 

21: Normalize Wi 

22: end while; 

23: Return: compressed model M 


Therefore, we propose to adapt the weights in every itera¬ 
tion of our greedy algorithm. Once the point Pg is selected in 
an iteration, we reduce the selection possibility of the plane 
or line containing this point: If Pg E ri, then Wi will be di¬ 
vided by 2 for all points Pi G ri in the next iteration. In this 
way, the plane or line from which the points have already 
been selected will be weight-reduced, and the other planes 
or lines will have higher possibilities to be selected later. A 
detailed algorithm can be found in Algorithm I. An exam¬ 
ple of model compression is shown in Figure 5(b). It shows 
that the main structure is kept and most of the noisy points 
are removed, and the model size is reduced from 105.1 MB 
to 14.4 MB. 

Because the 3D point cloud model is constructed in ad¬ 
vance, a visual word tree can be built for speedup before 
the image matching procedure. A visual word is a cluster 
of similar feature descriptors and the visual words are ob¬ 
tained by k-means clustering. Afterwards, a kd-tree [22] is 
built based on FLANN library [27] for fast indexing [30]. 

4.3 2D-to-3D Image Matching and Localiza¬ 
tion 

Given a test image, the interesting points (2D) are de¬ 
tected and their SIFT descriptors are computed. 2D-to-3D 
matching is referred to as finding the correspondence of the 
2D points in the test image and the 3D points in the com¬ 
pressed model. Then, the camera position can be estimated 
based on the correspondence. In our approach, a 2D-to-3D 
correspondence will be accepted if the first and second near¬ 
est neighbors in the descriptor space pass the ratio test with 
a threshold (that is set as 0.7 in our system). To speed it 
up, a prioritized correspondence search [30] is applied. After 







Figure 6: An overview of our model update algo¬ 
rithm. 

finding correspondences {{pi, Pi},where p 
is the corresponding 2D feature point of P and Nc is num¬ 
ber of correspondences, we run the 6-point DLT algorithm 
[9] with RANSAC for ego-positioning of the camera. 

4.4 Model Update 

Because the scenes are not always unchanged in the out¬ 
door environments and all vision-based approaches are sen¬ 
sitive to scene changes, how to update the model remains 
an issue. In practice, the lighting condition, weather, and 
even the structure of the scene would be changed. Previous 
works assume an up-to-date model having been constructed 
and do not handle the updating problem. We tackle this 
problem and an algorithm is proposed for solving it. 

An overview is shown in Figure 6. Instead of using only 
one 3D point cloud model, a model pool is kept for every 
local area in our system. It contains multiple 3D point cloud 
models of the same area constructed in different sessions of 
time. Notice that, we do not merge more observations into 
a single model here, because it will lead to more ambiguous 
when matching features and further enlarge the model size. 
That is why we use multiple small models. Furthermore, 
only one active model is selected and will be transmitted 
in our model update method. There are two main compo¬ 
nents included: model verification and model swapping. In 
the ego-positioning phase, the model verihcation component 
verifies whether the input image can be registered with the 
model well by the following criterion, 

(iVe>Ti)and (W/iVe>T 2 ), (7) 

where Nc is number of correspondences and Nj is the num¬ 
ber of inlier correspondences found by the pose-estimation 
algorithm [9]. Ti and T 2 are leant from the training data 
and set as 50 and 50%, respectively, in our experiments. 

If multiple observations claim that the currently active 
model is invalid, the model swapping component will be ac¬ 
tivated. It first evaluates all of the models in the model 
pool by (7) with the images collected in this session. Sec¬ 
ond, the model with the highest score (obtained based on 
(7)) will then be selected as the new active model if its score 
is higher than a threshold. Third, if all of the model scores 
are below the threshold, a new model will be constructed 


Table 1: Positioning error (cm) of single still image 
and number of points and model size after model 
compression. Red font and blue font show the min- 
imun and second minimum mean, stdev., or model 
sizes of four compression method, respectively. 


Scene 

#1 

#2 

#3 

#4 

Original 

Model 

Mean 

22.9 

21.6 

35.9 

33.7 

Stdev. 

8.1 

10.8 

15.6 

9.8 

points 

187,572 

53,568 

33,190 

85,447 

Size(MB) 

105.1 

21.5 

12.8 

26.4 

Compressed 
by baseline 
method 

Mean 

68.7 

38.2 

30.9 

40.0 

Stdev. 

143.5 

49.2 

40.5 

38.3 

points 

8,397 

4,329 

3,268 

8,263 

Size(MB) 

10.2 

2.5 

1.5 

5.4 

Compressed 
by [17] 

Mean 

21.3 

38.2 

46.24 

42.9 

Stdev. 

7.9 

20.1 

27.1 

72.8 

# points 

8,580 

5,101 

3,345 

8,272 

Size(MB) 

17.8 

3.0 

1.6 

4.8 

Compressed 
by [3] 

Mean 

24.7 

27.7 

41.1 

40.1 

Stdev. 

19.4 

17.5 

23.5 

48.5 

^ points 

8,519 

4,557 

3,149 

8,166 

Size(MB) 

19.9 

2.7 

1.7 

5.0 

Compressed 

by 

our method 

Mean 

25.9 

24.9 

39.0 

33.2 

Stdev. 

11.7 

12.9 

21.8 

14.0 

points 

7,781 

4,351 

3,228 

8,196 

Size(MB) 

14.4 

2.2 

1.5 

4.3 


by the images collected in this session to replace the active 
model. Fourth, if a model in the model pool is no longer used 
for a long time, it will be removed to avoid the unlimited 
growth of the model pool size. 

We have shown a few models (8 models for daytime in 
our experiments) are sufficient to represent one scene for 
positioning even when scene changes over a long period of 
time. It shows the case of constructing new model rarely 
happens and the proposed model swapping strategy works 
well for real applications. 

5 . EXPERIMENTAL RESULTS 

In this section, we first evaluate the positioning of a sin¬ 
gle still image with both local and up-to-date models in four 
scenes. Second, video sequences are tested. It shows that 
larger deviation will be obtained than testing of the still im¬ 
ages because of occlusion and/or motion blurs, but the re¬ 
sults can be compensated by applying temporal smoothing. 
Third, a dataset with more than 14,000 images of session 
data with 106 sessions is constructed and released. Finally, 
the proposed model update algorithm is evaluated. 

The ground truth positions in our experiments are all mea¬ 
sured in real scenes manually. Unlike most recent works [19, 
25, 34] using the results generated by offline structure-from- 
motion algorithms as the ground truth, all test images have 
ground-truths in our setting. The unregistered images will 
be skipped in the setup of the above works. Furthermore, 
we have found a number of images fail of being registered in 
our experiments, particularly for those taken at night. 

5.1 Positioning Evaluation of Single Still Im¬ 
age 

To show that a local and up-to-date model can lead to 
sub-meter positioning accuracy, we test the positioning algo¬ 
rithm in four scenes. Figure 7 shows the scenes, constructed 
models, and the compressed models generated by our algo- 

















































Figure 7: Four scenes: Scene #1, Scene #2, Scene #3, and Scene #4 from left to right respectively. The first 
row is the images of the scenes. The second row shows the constructed models, and the compressed models 
by our method are shown in the third row. 



Figure 8: Some examples of testing images in four 
scenes. 

rithm. The models are reconstructed with 400, 179, 146, 
and 341 training images, respectively. The test images are 
taken in the same session of the training images. There are 
50, 24, 28, and 88 test images, respectively, and Figure 8 
shows some of them. Table 1 demonstrates the positioning 
results and the comparison of model compressions with dif¬ 
ferent methods, where the baseline method is to keep 5% 
(for scene ^1) or 10% (for other scenes) mostly seen points 
on each plane or line, Li et al. method [17] is the approach 
to solve set k-cover problem (k = 720, 500, 250, and 200 for 
each scene, respectively), Cao and Suavely method [3] solves 
a probabilistic k-cover problem (k = 430, 300, 250, and 200 
for each scene, respectively), and our method is to formu¬ 
late a weighted set k-cover problem {k = 500, 300, 370, and 
185 for each scene, respectively) as depicted in Section 4.2, 
where we select a proper k for each method to ensure the 
number of points after model compression being similar to 
each other for fair comparison. 

It shows that sub-meter accuracy can be achieved even 
when model compression is performed for a local and up- 
to-date model. Furthermore, our compression method per¬ 
forms favorably against other methods in both positioning 


accuracy and model size reduction because of its capability 
of preserving the structure information better, which would 
lead to more stable positioning results. Table 1 shows our 
method performs slightly worse than [3, 17] in Scene #1, 
but it reduces model size more and further leads to the best 
performances in other scenes. Notice that both methods [3, 
17] perform even worse than baseline method in Scene #4, 
and all of them are with large standard deviation (about 
three times larger than ours). Moreover, our method out¬ 
performs than other methods in model size reduction, too. 
Figure 9 shows the compressed models of scene #4, and our 
method keeps more structural information compared with 
other methods. 

We do not focus on real-time processing in this paper. 
The average execution time of one image with 900x600 reso¬ 
lutions running on a PC with Intel Core i7-4770K processor 
and 16G ram is 1.4089 seconds (including SIFT feature ex¬ 
traction 1.274 seconds, feature matching 0.1319 seconds, and 
RANSAC pose estimation 0.0003 seconds). Most of the time 
are spent on SIFT feature extraction and can be improved 
by GPU or multi-thread implementations in the future. 

5.2 Positioning Evaluation of Video Sequence 

We then test two video sequences in Scene #4. For each 
test video, we recored the videos from two cameras with 10 
frames per second (fps). One is taken by a camera on the 
third floor of a nearby building, as shown in the bottom-left 
image of Figure 10(b). This is used for display and provid¬ 
ing visual verihcation. We click the coordinate of the man’s 
foot on the ground plane in each image and then transform 
the coordinates to the ground plane as the ground truth po¬ 
sition. The other is taken by a smart phone as shown in 
10(a) and the bottom-right image of Figure 10(b), which is 
used for vision-based positioning. The upper-left image in 
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Figure 9: The model compression results of (a) base¬ 
line method, (b) Li’s method [17], (c) Cao’s method 
[3] and (d) our method, in scene #4. 


Table 2: Positioning error (cm) of video sequences. 


Video sequence 

#1 

#2 

Frame number 

200 

281 

Single still image 

Mean 

60.1 

88.4 

Stdev. 

53.8 

116.7 

Temporal smoothing 

Mean 

37.2 

41.8 

Stdev. 

18.3 

26.3 



Figure 10: (a) The setup of taking video sequences 
on human. (b) An example of results of video 
evaluation. Bottom-right: image from smartphone; 
bottom-left: image from camera on third floor; 

upper-left: positioning results. 



Figure 10(b) demonstrates the estimation results, where red 
and green circles are the positioning results with still im¬ 
ages and the results after temporal smoothing by Kalman 
filter [13], respectively. More details can be seen in the demo 
video: PositioningDemo.mp4. Table 2 shows the quantita¬ 
tive results. Due to occlusions or motion blurs in the video, 
there are sometimes outlier or noisy estimations by using 
still-image positioning, but they can be compensated well 
after temporal smoothing. 

We also have another evaluation of video sequences taken 
by a dash camera on a car. Figure 11 shows the results 
and more results can be seen in the demo video: Posit ion¬ 
ingDemo.mp4. Although we do not have quantitative results 
for these video sequences, because it is hard to determine the 
ground truth positions of the dash camera, the demo video 
shows high accuracies are achieved while the estimated po¬ 
sitions lying on the car well. Notice that the test video 
sequences were taken on different days and model update al¬ 
gorithm have been applied to these video evaluation. More 
results of model update will be shown in Section 5.4. 

5.3 Long-Term Positioning Dataset 

To evaluate the model update, a dataset with long-term 
session data is built. It contains more than 1 month, 106 
sessions, 14,275 images, and 9,720 of them are with manually 
measured ground truth positions. There is no positioning 
dataset with such session data for evaluation, to our best 
knowledge. It includes the situations of sunny, cloudy, rainy, 
night, and so on. The session distribution and some sample 
images are shown in Figure 12. 

5.4 Positioning Evaluation with Model Update 

We then apply our model update algorithm to the long¬ 


Figure 11: Examples of results of video evaluation. 
Right column: image from dash camera; left column: 
image from camera on third floor, where red and 
green circles are the positioning results with still 
images and the results after temporal smoothing by 
Kalman Alter. 


term dataset. In the end of the first month, there are totally 
21 distinct models built in the model pool, with 8 for day¬ 
time and 13 for night. Figure 13 shows some of them. It 
shows that a few of models are sufficient for positioning in 
the daytime, but it is hard to register images at night. Ta¬ 
ble 3 presents some of the evaluation results in other months. 
Note that all models are selected automatically with the 
first ten images in that session by using our method in Sec¬ 
tion 4.4. It shows that our method can select a suitable 
model, and the results will be much improved when model 
update is applied in comparison to the use of a fixed model, 
for example, there are 100 times improvement at M3/4 14:30 
(red font in the table). The fixed model is even worse and 
can not be registered in some sessions (e.g. M2/4 17:30). 
Furthermore, we hnd positioning in the night scene is still 
challenging, which is not perfectly solved in this paper. The 
night scene (blue font in the table) with sufficient lighting, 
e.g. M2/9 18:30, is tackled well, but the performance de¬ 
grades under poor lighting conditions (e.g. M2/8 22:30). 
More efforts will be paid for night scene positioning in the 
future. 

5.5 Positioning for Augmented Reality Display 

To demonstrate the proposed positioning method can help 



















Figure 12: The long-term positioning dataset: (a) Session distribution with the number of images, and (b) 
Some examples of images. 



Figure 13: Some models in the model pool after learning. 


developing AR display, we load the point cloud model of 
Scene #4 to Maya and draw the corresponding structures 
of buildings and AR information on it, as shown in Fig¬ 
ure 14(a). We then load the camera poses of each image 
estimated by our method to Maya to render the projection 
image. Figure 14(b) is the result of projecting the point 
cloud onto an image, and Figure 14(c) show the result of 
AR display. As we can see, the AR information can be 
aligned to the image well and our method providing highly 
accurate estimation is helpful for future AR applications. 

6 . CONCLUSIONS 

We have proposed a novel algorithm for the loT frame¬ 
work to make vision-based positioning practical in real sit¬ 
uations. We formulate model compression as a weighted 
k-cover problem to condense the model while preserving the 
scene structure. Experimental results show that sub-meter 
positioning accuracy can be achieved based on the com¬ 
pressed models for both image and video tests. We also 
present a model update method for long-term data. The 
released long-term positioning dataset will be helpful to the 
future progress of vision-based positioning adaptive to scene 
changes. 
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